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(57)Abstract: 

PROBLEM TO BE SOLVED: To provide a method for summarizing 
compressed video. 

SOLUTION: In this method, the intensity of a motion activity is extracted 
from a shot in the compressed video. Next, a segment easy to summarize the 
video and a segment difficult to summarize the video are divided by using the 
intensity of the motion activity. The segment easy to summarize the video is 
expressed with an arbitrary frame selected out of the segment easy to §_ 
summarize the video and on the other hand, a frame sequence is generated 
from each of segments difficult to summarize the video by the summarization 
process of color base. By combining the selected frame and generated frame 
in each of segments in each of shots, the summary of the compressed video 
is formed. 



LEGAL STATUS 

[Date of request for examination] 

[Date of sending the examiner's decision of rejection] 

[Kind of final disposal of application other than the examiners 
decision of rejection or application converted registration] 

[Date of final disposal for application] 

[Patent number] 

[Date of registration] 

[Number of appeal against examiners decision of rejection] 

[Date of requesting appeal against examiner's decision of 
rejection] 

[Date of extinction of right] 




Copyright (C); 1 998,2003 Japan Patent Office 



(19) B#H4$fFfr (J P) 



< i2 > & m 4* ^ $g (a) 



<ii)ftiffflifi&iffl#*f 

#^2002-135804 
(P2002-135804A) 

(43) 'AM a ¥i£l44p 5 £10 B (2002. 5. 10) 



(51)Inta 7 
H0 4N 11/04 
G 0 6 T 7/20 
H 0 4 N 5/92 
7/24 



F I 

HO 4N 
GO 6T 
HO 4N 



11/04 
7/20 
7/13 
5/92 



7-73-r(^%) 
Z 5C0 5 3 
C 5 C 0 5 7 
Z SCO 5 9 
H 5L0 9 6 



«2K&J< *M3c »*Jg<D»8 OL ^BSftiK (± 31 H) 



(21)tHH#^ 



(22)fflBiB 



(32)«5feB 



*HB2001 -229656( P2001 -229656) 

*W£13# 7 £ 30B (2001. 7. 30) 

O 9/6 34 3 64 
¥/£l2^ 8^9 0 (2000. 8. 9) 
*H (US) 



(71) USA 597067574 

>7VviS. ^D— H^x-f 201 

(72) 56W* T^-f 

>^;k 3^- d — H 20 

(74)ftSA 100057874 



(54) E89I<Z>£«v] J»£^^fc£tf#e7-gBj&^ 



(57) imm («fjE*> 

[HUB] EElifcr^SrStt-rs^feSrSeti-So 




CM 




1 



(2) 



&ffl 2002-1 35804 

2 



9=7 7.(07 \s—J*t\z.ftm-tZ>*"r y-7t. 

mflE^ 1 <D? ?7(0&-t y-*>hfr b&M<V 1 o£ 
ffttS.fi 7 Kb ^T, WI2^2co^^^O#-fe:^> 

\?T*(Dmm*mffcfZ>*"ryy"t^ £^ti\ #ifc 0 

£ix/c7 ASrWFWWJKff-CiB^'frtJ-ere^T'i/^S: 20 

[f9#:3t 3 ] SWB»« * ixfc tfiMrfB^JsS 
£ ixfc 7 \s — A £ £PiJ] #J«ftff "da^-g- £ ^ y 7*£r 
S€>fc:*tN if 1 IS® 

4 ] WEMMl * *tfc 7 A:fc i: tfltWB^jtfc 

tSxf^^^bl^ Bt*« 3 IB*o*fe 

5 ] MIS^ £ titz 7 \/ — A^o J: t«(rl5*jEK 
£ *x/c 7 1/- A WJK^-Cla^'&fc* 7*7~ y~7° 

*&h\^ts, m*m\^m<ojjfe» 30 

— a£ s s»fe^7— «Fjst^oT^^^^tci//i— ^<k: 

mm? 7**<Dm?fo$:mfr&fritx, mm±&t$frt^7 

ffl S ft* / 8S*js 1 ISife^^feo 
[0001] 

[00 0 2 j 

[0 0 0 31 0E« trf** 7 *— ;/ h so 



MPEG (Motion Picture Expert Group) lc£gjfj£*x 
-f*. 7;«7i/-A, 1"#^7 l^~Art??-5|-{b7 

Ate ri7i^Aj rr>#— 7u— aj £ 

7^-A&Wfft7l/-Ml r B 7 U — Aj *>J;U< 
rP7U~Aj , &tzte rf »7 I/- Aj tmittsz. 

[00041 it, t'^^v/— ^r>^CO#7 A(i x 
#i*J£*t6<> #7*n y^fiitnf^y^ (DCT) 

tmXthZ, ^ttl^ftMD 87 8 % ^fclil 67I6/P7 

[0 0 0 5] DCT^f£teii^*/l^— 

[0 0 0 6] — #x 7 ^^P^n^^|j:M, 1^ 

DCT«»iiia#, aui;*^bS:J6SixTd^p>, j£fBiK 
ff«-B8UT7VU>^^fb^J:0 5 nraEft«P^b'*ix 

[0007] 7i/-A^{kf-^, ■f-tt^*>W#fb 

Ptf:liB7P'-Af^^Ov^P7 f P'7^it ^ffiflfc 0 
^ir/P<t-^^ p^p ^^^ioit^H^Ofc 0 ^i?/WOPfitf> 
^O^Sr^i" 0 7 i/-Artff f ftf-^* J:V7 i/- 
AfS«f%f-^^)^^p^p^ttifc, fflt>bixrc 
ft^fbco u-</u x p/d y^^r Ki/^^y^^- 

o y?co?4y*m<Dmmt>^tr<> &#<D\nwin, ^y 

[0 0 0 8] ^P7U-A(i > ft&<£> I 7 A^fcfl 
P7 A;6>(b^$|£*l£o ^-B7U-Ali, ^tt4r« 

tfI7l/-A$fcttP7l A^P>^»J*tt5 0 ^J»J*f 

-^-fb7^oir I 7 ACO<t*CO^^ d^p y?(D^ 



3 



(3) 



ftffl2 002-135804 

4 



{•£37 U — J*<D-7u yffrh f?-t/l-mzi$.Chtl. &L 

&&m.ttZ> 0 &ft£nfc»S*:fcJ:t/-<? bMt, P7U 

So 

10 0 0 9] fcfx^tff 

it. r<£u-</Uj 0>KffHM39ff3»»& rjsu^c^j of 10 

[0 0 1 01 i&U-</W<7>3Sfl?ii, #7—, >ai^ 
te£i%Z> 0 i& ^^^(DWrn*.. iff** ti/a y hj K 

V^fi, m—<D US—is} £&9i£tr 0 tew^oflftR 

[0 0 11] ify'^-^jftflcKftSftPWc, ^Vr-^^Ogp 
K (talking head : ®ffi^lSL^s»»i-'6tO) J <£> 

[ooi2] yfrx-mmt 30 

K>^kyfL-f-<X(Dy A££ fbd^toSC dr^-e^ 40 

[0 0 13] eift^^iSftfc^rfett^fei-C*)*?, S. P 
feiferM(^«t 6 f Abstracting Digital Movies Automat 
icallyj (J. Visual Comm. Image Representation, vo 
1. 7, no. 4. pp. 345 - 353, 1 9 9 6^12^) 
XXJ r An Integrated Schemefor Automated Video Abst 
raction Based on Unsupervised Cluster — Validity 
Ana lysis J (IEEE Trans. On Circuits and Systems fo 50 



r Video Technology .Vol . 9, No. 8, 1 9 9 9^1 2 

[0014] Srb&<tobft,x\,^z> t'T^HjRrtk^te 

li. MhXy— ^(DmfolklzM^LXi^o Pfeiff 
^WTfflV>t^5o Lri>U PfeifferfdKCj:^^ 

[0 0 1 5] iKOTti^c, J:* tf^tf- 

x i o o * igffi UT tr^HiKi s (a) i o 2 ^Mf 
S 0 br^^-Sfttt. ^i^io^i), &svm2 

[0016] jymi o Oteii?^ J^T^fy^ 

— ^1t3t{C-rS 0 gS3K, ^^^^^^P>^— 7U— J± 

[00 17]8i#7^7>f t'^^iS^T- 

X<Dm% T^r^t'r^ fcHig-r SE^SrftW-r So 
[0 0 18] 

[0 0 19] 

v~j±<D^ itb z-tt*—7 j±<d&^ £ tzte*—t 

y*>y<D7u—j±<n$tm^±*), m&j^&m-rzzb 

[0020] tr^oidfrr tfx>r ots^s 
^7"^ t'^^co^^i^i^^, ^7-^a<o^fb 



(4) 

5 

So 

[0 0 2 2] J:DA*W*w.' *38Wtt, if t'f^^ 
[0 0 2 3] fSKrfbds^S**^^ > Hi. m—:7U~ 20 

osstfr < ;b-r ^-efcsfci?). r ^ A«tv*-f;»x*> 

[0 0 2 4] »3^;hHi:l::tt: % B$MltfK SSBBftK 
toIBetf>|§<g\ 7U- ^(D^bm^^^X^^ 

[0 0 2 5] 

[0 0 2 6] i*«atu 

4§-£\ DCMll ^*-EE«|*pl»-r-5w t 
* h/^/^Tifi^lTS ~ Ye 
o f"0n theExtraction of DC Sequence from MPEG 

videoj (IEEE ICIP Vol. 2. 1 9 9 5^) Z&mtSiX so 



4$&f]2 002-1 35804 

6 

/c^ 0 DCBHfeOYUVffitt, #7— W»*ttfcH-r6fc 
10 0 2 7] 

h^7A£/Bi*£ 0 ^7- fc* h^Afl, 
Vk&<Dm5\ttrtl3 <fc t*tt3!KJ£ < ffi^ htlX £ XI 
SmitMtil(£i;<5 r Automated Image Retrieval Using Col 
or and Texture J (IEEE Transaction on Pattern Anal 
ysis and Machine Intelligence, 1 9 9 6f 1 1 M) 
&&m$tl1Z\f\ il^, 3^^^/URGB^^-Cli, 

\Xmk 6 4 (4X4X4) fi^t>M^tifo5o 
[0 0 2 8] 

S&<5;*7&aS£n£>*LT^£> 0 Tanftilt-J:-5 Ta new metho 
d for camera motion parameter estimation J (Proc. 
IEEE International Conference on Image Processing, 

Vol. 2. pp. 722 - 726, 1 9 9 5$) % IEEE Trans. 

on Circuits and Systems for Video Technology 1 9 

9 9mz&btvZ>Tanmz£Z> TRapid estimation of c 
amera motion from compressed video with applicatio 
n to video annotationj . KoblatfJltC X -5 f Detection 
of slow-motion replay sequences for identifying sp 
orts videosj (Proc. IEEE Workshop on Multimedia S 
ignal Processing, 1 9 9 9$) „ Koblatet-J;^ ^Spe 
cial effect edit detection using VideoTrails: a co 
mparisonwith existing techniquesj (Proc. SPIE Con 
ference on Storage and Retrieval for Image and Vid 
eo Databases VII. 1 9 9 9$), Koblato^<t^ TCom 
pressed domain video indexing techniques using DCT 

and motion vector information in MPEG videoj (Pr 
oc. SPIE Conference on Storage and Retrieval for I 
mage and Video Databases V, SPIE Vol. 3022, pp. 20 
0 - 211. 1 9 9 7$) x & Jct/Mengf&fc: <fc 6 TCVEPS 
— a compressed video editing and parsing system J 

(Proc. ACM Multimedia 96, 1 9 9 6$) £&m£iv 

[0 0 2 9] Jh3£L7cJ: 9fC, %sX<D'&X&ffilC£Z>m 

mm?fe&, WWl<d? 7 afrits— xk\*X* 

[ 0 O 3 0 ] Divakarantem^O^H^mm^ 0 9/ 
4 0 6, 4 4 4^ TActivity Descriptor for Video Se 
quencesj 12, JE^ilj^lC^ott ^ h/^f>a(U 



[003 i] *wai*-c«:, fcr^ic^jitar^^w ^ 

[00 3 2) t&t^JltffeCD^ft: 
MPEG- 7 r^^h-fe-zhj f7^CD#^— # 

\C X t^TO I 7 l^-^^P> 6 4^>RGBb^h^7 
fflb. ir h # :y h (segmentcut) ^fctef&Oi? 

[00 3 3] m 2*5 J:t/m 3 tt-ttt-PH. fjomaldano 
itelj ifeJlt* TnewslJ V^y M£Ol^TcD!i&# T 

tt'mW7^;^!)y/U ^IMMit-fe^^ 
[00 3 4] 

(A) 4 0 2 S:Mt5fc«)0*S4 O 0 */Tt, 

[0 0 3 5] A^EE®t'^4 0 1 12. S»SF-C«*0"C 



2 002-1 35804 

8 

[0 0 3 6] 1 OH, $->3 ^ 

fgico^^te, S«)ffc^JtR»»S 

[0 0 3 7] jMH*tmt-o*>3«.£<'3 * hoigg^ 
•^li, R»*b* s r^^j ^t^yh4 i ifc^iMi^ 

[0 0 3 8] 3-^3 3/ hO^/jJt^^ > h 4 1 1 T* 
fi> ir^^ > M>k=*— :7 A £ fetter— 7 .Av- 

— 'jr^&aRi-* (4 2i) nt-c. -fe^^^hcom 

^&^<b4 2 0£r*T? o 8I4t^y^07U- 

^fl^ftS*— :7 A4 2 1 12 V ir^VvhcfW 

[0 0 3 9] £-> 3 y hcOffliS^-fe^p< > h 4 1 2 {do 
^Tfi, ^ ^~ ^O^fb^D-fe^ 5 0 O^rifffiL 

wmik-tz* 

[0 0 4 0] #v-3 3/ hCO^r— A 4 2 1*5^^4 

3 1 &m.fr&t>1tX&'isa y h^^Ml, > 3 y 

4 0 2 Ztetit'tZZ 

[0041] y V"-j*<D®.^fritia*, ksfmw. ^rai 

*tX*V : '(? J $>$>Z>m?tL 3ti:ttff*«ia, loo 
[0 0 4 2] tv^^— .xofiasHb^ 

ot7 5 0 0^7^^f o ^7-^/5 1 0li x E 
Mte-ty*>-h4 l 2^n-^tt<o^u— ^^r^7^— 

(^oT^7**{C;?^**{b-r5o ^7^/5 2 0 

"t"^)o ^7y7'5 3 0|j:, ^7^^^^>7U- Ai/-^ 
>^^tttULT^^^^co^5 3 1 SrMt^d k 
X\ mWkte-ty* V h 4 1 2<7)^-^ =7^5 5 11 
^b-T5 0 ^7-^4 4 011 ^ =7 * 9 (D^fo^UJ+St) 
-frT. il^t^>h4 1 2$r^t6^r^7^A 
>-^>^4 3 1 ^Mt^o 

[0 0 4 3] fr^uyf V>(i3fefC, ^r— 7 U — 



(6) 



¥tm 2002-1 35804 



9 



10 



-OM&fbiTflBfc, rh-^y^/syKj cot&T^v-aV 

*?{t-7alzx 5 0 Oli. £ , 9i^l>u-</UC07'^>'3 V 

[0 0 4 4] |^ 6 te, ff**Kfc#8:4 0 0 &mT^~f 0 A 
A ^ 6 0 1 a 7 h 6 0 3 IcE^kt 5 (6 0 
2) o fj#7^r^ t*r^^#T6 0 4^i/ 3 ^ h(^7U 

h6 0 5^fta o ^^^ir^^> h^MlHlfc 
(6 0 7) ^r-7U-A, ir^^K tfctt->s y h 

60 6^, f^^bitx^—ftvf (6 0 9) frhmmz 
ntz* ^—<—*<DWfo 6 o 8 tm&&t>it. 
mm 6 1 o&Mi&tZo 

[0 0 4 5] locO®ffitC^V>T > Ktt#4BE«f tf^s}-** 

[0 0 4 6] £<bic x »tticfp«$jx<5WB«ca;-5# % 



[0047] &&L^nmmm<Dm&mLx*&m&m 
-r^r tx&>z> 0 

7 — mt*%t1r ? =7 -7 xh 5 o 

20 £> 0 

[ HI 5 ] >fc$gW f c J: 5 # 7 — * commits v -tr ^ 
[06] ^WtciSK^b^ftSr^-r^o s/^Bt 



[HI 2 J 



? " 

101 




7 



102 



ST 



4iJ 




0.1 0.2 0.3 0.4 0.6 0.6 0.7 0.8 0.9 1 



(7) 



ftffl 2002-1 35804 



103] 



9 
-flu 



10 
9 
8 
7 
6 
5 

3 
2 
1 
O 



t — r 



• - • , • .1 * 

-* >* — - . Jl •• - 




>*••• «*•• . « • » 



±2- 



200 400 600 800 1000 1200 14O0 1600 



[14] 



401 



410 

-2_ 



411 



420 

z_ 



421 



412 




[05] 



412 

? 



mntz 



610 



£20 
-V- 



9 9**9: 



~l 

511 



630 



640 



*9*9<T> 



521 



T 

531 



:7U- — A CD 
s* — ^ 



413 



(8) 



ftffl 2002-135804 



[0 6] 




is a y ho 



\ss y h 



i/ h y h 3 




604 



->3 y h 



Ss 



S 6 



603 









• « • < 
















e { 


- I 


e 


d|e id I e 


a 




e 


d i 

• 

■ • 


e 


e 


e 




60S 



(71) fcB«BA 597067574 

201 BROADWAY, CAMBRI 
DGE, MAS SACHUSETTS 
02139, U. S. A. 

(72) 360j# T ■ T— • ^;*7 — 

7V<— h^yh 2 



(72)^^# 7 r > * * > 

^^-<y— . *cls?\sy h • K7-f ^ 
61 

F^ — 5C053 GA11 GB09 GB19 GB30 GB37 

5C057 DA06 EA06 EG08 EM04 FB03 
5C059 MAOO MC32 NN21 PP06 PP07 

PP16 PP26 TD10 
5L096 AA02 AA06 BA20 FA23 HA02 

HA04 JA11 KA09 



•9- 



(9) 



2002-135804 



1 Title of Invention 

Method for Summarizing a Video 
Using Motion and Color Descriptors 

2 Claims 

1 . A method for summarizing a compressed video including motion and 
color features, comprising: 

partitioning the compressed video into a plurality of shots; 

classifying each frame of each shot according to the motion features, a 
first class frame having relatively low motion activity and a second class 
frame having relatively high motion activity; 

grouping consecutive frames having the same classification into 
segments; 

selecting any one or more frames from each segment having the first 
classification; 

generating a sequence of frames from each segment having the second 
classification using the color features; and 

combining the selected and generated frames of each segment of each 
shot to form a summary of the compressed video. 

2. The method of claim 1 further comprising: 

combining the selected and generated frames in a temporal order. 

3. The method of claim 1 further comprising: 

combining the selected and generated frames in a spatial order. 

4. The method of claim 3 further comprising: 
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reducing the selected and generated frames in site to form miniature 
frames. 

5. The method of claim 1 further comprising: 

combining the selected and generated frames in a semantic order. 

6. The method of claim 1 further comprising: 

grouping the frames of each segment having the second classification 
into clusters according to the color features; 

generating a cluster summary for each cluster; and 

combining the cluster summaries to form the generated sequences of 
frames. 

7. The method of claim 1 wherein the summary is produced while playing 
the video. 

8. The method of claim 1 wherein the summary is used to resu mm arize the 
video. 
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3 Detailed Description of Invention 
FIELD OF THE INVENTION 

This invention relates generally to videos, and more particularly to 
. summarizing a compressed video. 

BACKGROUND OF THE INVENTION 

It is desired to automatically generate a summary of video, and more 
particularly, to generate the summary from a compressed digital video. 

Compressed Video Formats 

m 

Basic standards for compressing a video as a digital signal have been 
adopted by the Motion Picture Expert Group (MPEG). The MPEG standards 
achieve high data compression rates by developing information for a full 
frame of the image only every so often. The roll image frames, i.e. intra- 
coded frames, are often referred to as *1- frames* or * •anchor frames,*' and 
contain full frame information independent of any other frames. Image 
difference frames, i.e., inter-coded frames, are often referred to as "B- 
frames" and "P-fxames," or as "predictive frames,** and are encoded between 
the I-frames and reflect only image differences i.e., residues, with respect to 
the reference frame. 

Typically, each frame of a video sequence is partitioned into smaller blocks 
of picture element, Le. pixel, data. Each block is subjected to a discrete 
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cosine transformation (DCT) function to convert the statistically dependent 
spatial domain pixels into independent frequency domain DCT coefficients. 
Respective 8x8 or 16x16 blocks of pixels, referred to as "macro-blocks," are 
subjected to the DCT function to provide the coded signal. 

The DCT coefficients are usually energy concentrated so that only a few of 
the coefficients in a macro-block contain the main part of the picture 
information. For example, if a macro-block contains an edge boundary of an 
object, then the energy in that block, after transformation, as represented by 
the DCT coefficients, includes a relatively large DC coefficient and 
randomly distributed AC coefficients throughout the matrix of coefficients. 

A non-edge macro-block, on the other hand, is usually charactered by a 
similarly large DC coefficient and a few adjacent AC coefficients which are 
substantially larger than other coefficients associated with that block. Hie 
DCT coefficients are typically subjected to adaptive quantization, and then 
are run-length and variable-length encoded. Thus, the macro-blocks of 
transmitted data typically include fewer than an 8 x 8 matrix of codewords. 

The macro-blocks of intcr-coded frame data, i.e., encoded P or B frame data, 
include DCT coefficients which represent only the differences between a 
predicted pixels and the actual pixels in die macro-block. Macro-blocks of 
intra-coded and inter-coded frame data also include information such as the 
level of quantization employed, a macro-block address or location indicator, 
and a macro-block type. The latter information is often referred to as 
"header" or "overhead" information. 
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Each P-frame is predicted from the lastmost occurring I- or P-frame. Each 
B-frame is predicted from an I- or P-frame between which it is disposed. 
The predictive coding process involves generating displacement vectors, 
often referred to as "motion vectors" which indicate the magnitude of the 
displacement to the macro-block of an I-frame most closely matches the 
macro-block of the B- or P-frame currently being coded. The pixel data of 
the matched block in the I frame is subtracted, on a pixel-by-pixel basis, 
from the block of the P- or B-frame being encoded, to develop the residues. 
The transformed residues and the vectors form part of the encoded data for 
the P- and B-frames. 

Video Analysis 

Video analysis can be defined as processing a video with the intention of 
understanding the content of a video. The understanding of a video can 
range from a 'low-lever' syntactic understanding to a "high-level" semantic 
understanding. 

The low-level understanding can be achieved by analyzing low-level 
features, such as color, motion, texture, shape, and the like. The low-level 
features can be used to partition the video into "shots." Herein, a shot is 
defined as a sequence of frames that begins when the camera is turned on 
and lasts until the camera is turned off. Typically, the sequence of frames in 
a shot captures a single "scene/' Hie low-level features can be used to 
generate descriptions. The descriptors can then be used to index the video, 
e.g., an index of each shot in the video and perhaps its length. 
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A semantic understanding of the video is concerned with the genre of the 
content, and not its syntactic structure. For example, high-level features 
express whether a video is an action video, a music video, a "talking head 
video, or the like. 



Video Summarization 

Video summarization can be defined as generating a compact representation 
of a video that still conveys the semantic essence of the video. The compact 
representation can include "key" frames or "key" segments, or a 
combination of key frames and segments. As an example, a video summary 
of a tennis match can include two frames, the first frame capturing both of 
the players, and the second frame capturing the winner with the trophy. A 
more detailed and longer summary could further include all frames that 
capture the match point. While it is certainly possible to generate such a 
summary manually, this is tedious and costly. Automatic summarization is 
therefore desired. 

Automatic video summarization methods are well known, see S. Pfeifer et 
al. in "Abstracting Digital Movies Automatically," J. Visual Comm. Image 
Representation, vol. 7. no. 4, pp. 345-353. December 1996. and Hanjalic et 
al. in "An Integrated Scheme for Automated Video Abstraction Based on 
Unsupervised Cluster-Validity Analysis," IEEE Trans. On Circuits and 
Systems for Video Technology, Vol. 9, No. 8. December 1999. 

Most known video summarization methods focus exclusively on color-based 
summarization. Only Pfciffer et al. have used motion, in combination with 
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other features, to generate video summaries. However, their approach 
merely uses a weighted combination that overlooks possible correlation 
between the combined features. Some summarization methods also use 
motion features to extract key frames. 

As shown in Figure 1 , prior art video summarization methods have mostly 
emphasized clustering based on color features, because color features are 
easy to extract and robust to noise. A typical method takes a video A 101 as 
input, and applies a color based summarization process 100 to produce a 
video summary S(A) 102. The video summary consists of either a single 
summary of the entire video, or a set of interesting frames. 

The method 100 typically includes the following steps. First, cluster the 
frames of the video according to color features. Second, arrange the clusters 
in an easy to access hierarchical data structure. Third, extract a key frame or 
a key sequence of frames from each of the cluster to generate the summary. 

Motion Activity Descriptor 

A video can also be intuitively perceived as having various levels of activity 
or intensity of action. Examples of a relatively high level of activity is a 
scoring opportunity in a sporting event video, on the other hand, a news 
reader video has a relatively low level of activity. The recently proposed 
MPEG-7 video standard provides for a descriptor related to the motion 
activity in a video. 
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SUMMARY OF THE INVENTION 

It is an objective of the present invention to provide an automatic video 
summarization method using motion features, specifically motion activity 
features by themselves and in conjunction with other low-level features such 
as color and texture features. 

The main intuition behind the present invention is based on the following 
hypotheses. The motion activity of a video is a good indication of the 
relative difficulty of summarization the video. The greater the amount of 
motion, the more difficult it is to summarize the video. A video summary 
can be quantitatively described by the number of frames it contains, for 
example, the number of key frames, or the number of frames of Jcey 
segments. 

The relative intensity of motion activity of a video is strongly correlated to 
changes in color characteristics. In other words, if the intensity of motion 
activity is high, there is a high likelihood mat change in color characteristics 
is also high. If the change in color characteristics is high, then a color feature 
based summary will include a relatively large number of frames, and if the 
change in color characteristics is low, then the summary will contain fewer 
frames. 

* 

For example, a 'talking head** video typically has a low level of motion 
activity and very little change in color as well. If the summarization is based 
on key frames, then one key frame would suffice to summarize the video. If 
key segments are used, then a one-second sequence of frames would suffice 
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to visually summarize the video. On the other hand, a scoring opportunity in 
a sporting event would have very high intensity of motion activity and color 
change, and would thus take several key frames or several seconds to 
summarize. 

More particularly, the invention provides a method that summarizes a video 
by first extracting intensity of motion activity from a video. It then uses the 
intensity of motion activity to segment the video into easy and difficult 
segments to summarize. 

Easy to summarize segments are represented by a single frame, pr selected 
frames anywhere in the segment, any frame will do because there is very 
little difference between the frames in the easy to summarize segment. A 
color based summarization process is used to summarize the hard segments. 
This process extracts sequences of frames from each difficult to summarize 
segment. The single frames and extracted sequences of frames are combined 
to form the summary of the video. 

The combination can use temporal, spatial, or semantic ordering. In a 
temporal arrangement, the frames are concatenated in some temporal order, 
for example first-to-last, or last-to-first. In a spatial arrangement, miniatures 
of the frames are combined into a mosaic or some array, for example, 
rectangular so that a single frame shows several miniatures of the selected 
frames of the summary. A semantic ally ordered summary might go from 
most exciting to least exciting. 
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DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

Our invention summarizes a compressed video using color and motion 
features. Therefore, our summarization method first extracts features from 
the compressed video. 

Feature Extraction 

Color Features 

We can accurately and easily extract DC coefficients of an I* frame using 
known techniques. For P-and B -frames, the DC coefficients can be 
approximated using motion vectors without full decompression, see, for 
example, Yco et al. "On the Extraction of DC Sequence from MPEG video ? 
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IEEE ICIP Vol. 2, 1995. The YUV value of the DC image can be 
transformed to a different color space to extract the color features. 

The most popular used technique uses a color histogram. Color histograms 
have been widely used in image and video indexing and retrieval, see Smith 
et al. in "Automated Image Retrieval Using Color and Texture," IEEE 
Transaction on Pattern Analysis and Machine Intelligence, November 1996. 
Typically, in a three channel RGB color space, with four bins for each 
channel, a total of 64 (4x4x4) bins are needed for the color histogram. 

Motion features 

Motion information is mostly embedded in motion vectors. Motion vectors 
can be extracted from P- and B -frames. Because motion vectors are usually a 
crude and sparse approximation to real optical flow, we only use motion 
vectors qualitatively. Many different methods to use motion vectors are 
known, see Tan et al. "A new method for camera motion parameter 
estimation," Proc. IEEE International Conference on Image Processing, Vol. 
2, pp. 722-726, 1995, Tan ct al. "Rapid estimation of camera motion from 
compressed video with application to video annotation." to appear in IEEE 
Trans, on Circuits and Systems for Video Technology. 1999. Kobla et al. 
"Detection of slow-motion replay sequences for identifying sports videos;* 
Proc. IEEE Workshop on Multimedia Signal Processing, 1999, Kobla et al. 
"Special effect edit detection using VideoTrails: a comparison with existing 
techniques;' Proc. SPIE Conference on Storage and Retrieval for Image and 
Video Databases VII, 1999, Kobla et al., "Compressed domain video 
indexing techniques using DCTand motion vector information in MPEG 
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videos Proc. SPIE Conference on Storage and Retrieval for Image and 
Video Databases V, SPIE Vol. 3022, pp. 200-21 1, 1997, and Meng et al. 
"CVEPS - a compressed video editing and parsing system," Proc. ACM 
Multimedia 96, 1996. 

As stated above, most prior art summarization methods are based on 
clustering color features to obtain color descriptors. While color descriptors 
are relatively robust to noise, by definition, they do not include the motion 
characteristics of the video. However, motion descriptors tend to be less 
robust to noise, and therefore, they have not been as widely used for 
summarizing videos. 

U.S. Patent Application Sn. 09/406.444 "Activity Descriptor for Video 
Sequences, filed by Divakaran et al. describes how motion features derived 
from motion vectors in a compressed video can be used to determine motion 
activity and the spatial distribution of the motion activity in the video. Such 
descriptors are successful for video hrowsing applications. Now, we apply 
such motion descriptors to video summarization. 

We hypothesize that the relative level of activity in a video can be used to 
measure the "suramarizability" of the video. Unfortunately, there are no 
simple objective measures to test this hypothesis. However, because changes 
in motion often are accompanied by changes in the color characteristics, we 
investigate the relationship between the relative intensity of motion activity 
and changes in color characteristics of a video. 
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Motion and Color Changes 

Wc do this by extracting the color and motion features of videos from the 
MPEG-7 "test- set." We extract the motion activity features from all the P- 
frames by computing the average of motion vector magnitudes, and a 64-bin 
RGB histogram from all the 1-frames. We then compute the change in the 
histogram from I-frame to I-frame. We apply a median filter to the vector of 
frame-to-frame color histogram changes to eliminate changes that 
correspond to segment cuts or other segment transitions. We plot the 
intensity of motion activity versus the median filtered color change for every 
frame as shown in Figures 2 and 3* 

* ■ 

Figures 2 and 3 respectively show the relationship between intensity of 
motion activity and color dissimilarity for 'Jornaldanoitel" and "newsl" test 
sets. There is a clear correlation between the intensity of motion activity and 
the change in color. For low activity, it is very clear that the change in color 
is also low. For higher activity levels, the correlation becomes less evident 
as there are many possible sources of high activity, some of which may not 
result in color content change. However, when the activity is very low, it is 
more likely that the content does not change frame-to-frame. We use this 
information to pre-filtering a video to detect segments which are almost 
static, and hence, these static segments be summarized by a single key 
frame. Based in these results we provide the following summarization 
method. 
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Summarization Method 

Figure 4 shows a method 400 for summarizing an input compressed video A 

> 

401 to produce a summary S(A ) 402. 

The input compressed video 401 is partitioned into "shots* 1 using standard 
techniques well known in the art, and as described above. By first 
partitioning the video into shots, we ensure that each shot is homogenous 
and does not include a scene change. Thus, we will properly summarize a 
video of, for example, ten consecutive different "talking head" shots that at a 
semantic level would other wise appear identical. From this point on the 
video is processed on a shot-by-shot manner. 

Step 410 determines the relative intensity of motion activity for each frame 
of each shot. Each frame is classified into either a first or second class. The 
first class includes frames that are relatively easy to summarize, and the 
second class 412 includes frames that are relatively difficult to summarize. 
In other words, our classification is motion based. 

Consecutive frames of each shot that have the same classification are 
grouped into either an "easy" to summarize segment 411, and a "difficult" to 
summarize segment 412. 

For easy segments 41 1 of each shot, we perform a simple summarization 
420 of the segment by selecting a key frame or a key sequence of frames 421 
from the segment. The selected key frame or frames 421 can be any frame in 
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the segment because all frames in an easy segment are considered to be 
semantically alike. 

For difficult segments 412 of each shot, we apply a color based 
summarization process 500 to summarize the segment as a key sequence of 
frames 431. 

The key frames 421 and 431 of each shot are combined in form the summary 
of each shot, and the shot summarizes can be combined to form the final 
summary S(A) 402 of the video. 

The combination of the frames can use temporal, spatial, or semantic 
ordering. In a temporal arrangement, the frames are concatenated in some 
temporal order, for example first-to-last, or last-to-firsL In a spatial 
arrangement, miniatures of the frames are combined into a mosaic or some 
array, for example, rectangular so that a single frame shows several 
miniatures of the selected frames of the summary. A semantic ordering 
could be mots-to-least exciting, or quite-to-loud. 

Figure 5 shows the steps of a preferred color based summarization process 
500. Step 510 clusters the frames of each difficult segment 412 according to 
color features into clusters. Step 520 arranges the clusters as a hierarchical 
data structure 521. Step 530 summarizes each cluster 511 of the difficult 
segment 412 by either extracting a sequence of frames from the cluster to 
generate cluster summaries 531. Step 440 combines the cluster summaries to 
form the key sequence of frames 431 that summarize the difficult segment 
412. 
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This method is especially effective with news- video type sequences because 
the content of the video primarily comprises low-action frames of "talking- 
heads" that can be summarized by key frames. The color-based clustering 
process 500 needs to be carried out only on for sequences of frames that 
have higher levels of action, and thus the overall computational burden is 
reduced. 

Figure 6 shows the summarization method 400 graphically. An input video 
601 is partitioned 602 into shots 603. Motion activity analysis 604 is applied 
to the frames of the shots to determine easy (e) and difficult (d) segments 
605. Key frames, segments, or shots 606 extracted 607 from easy segments 
are combined with color based summaries 608 derived from clustered color 
analysis 609 to form the final summary 610. 

In one application, the summary is produced dynamically from the 
compressed video so that the summary of the entire video is available to the 
viewer within minutes of starting to "play" the video. Thus, the viewer can 
use the dynamically produced summary to "browse" the video. 

Furthermore, based on the dynamically produced summary, the user can 
request for certain portions to be resummarized on-the-fly. In other words, as 
the video is played, the user summarizes selected portions of the video to 
various levels of detail, using the summaries themselves for the selection 
process, perhaps, using different summarization techniques for the different 
portions. Thus, our invention provides a highly interactive viewing modality 
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that hitherto now has not been possible with prior art static summarization 
techniques. 

Although the invention has been described by way of examples of preferred 
embodiments, it is to be understood that various other adaptations and 
modifications may be made within the spirit and scope of the invention. 
Therefore, it is the object of the appended claims to cover all such variations 
and modifications as come within the true spirit and scope of the invention. 

4 Brief Description of Drawings 

Figure 1 is a block diagram of a prior art video summarization method; 

Figures 2 and 3 are graphs plotting motion activity versus color changes for 
MPEG test videos; 

Figure 4 is a flow diagram of a video summarization method according to 
the invention; and 

Figure 5 is a flow diagram of a color based summarization process according 
to the invention. 

Figure 6 is a block diagram of a summarization method according 
to the invention. 
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1 Abstract 

A method extracts an intensity of motion activity from shots in a compressed 
video. The method then uses the intensity of motion activity to segment the 
video into easy and difficult segments to summarize* Easy to summarize 
segments are represented by any frames selected from the easy to summarize 
segments, while a color based summarization process extracts generates 
sequences of frames from each difficult to summarize segment. The selected 
and generated frames of each segment in each shot are combined to form the 
summary of the compressed video. 



2 Representative Drawing Fig. 4 



